Geodesic acceleration and the small-curvature approximation for 

nonlinear least squares 



(N 



O 



Mark K. Transtrum 
Department of Bioinformatics and Computational Biology, 
University of Texas M.D. Anderson Cancer Center, Houston Texas, U.S.a\\ 



O ■ James P. Sethna 

Laboratory of Atomic and Solid state Physics, 



Cornell University, Ithaca, New York 14853, USA 

It has been shown numerically that the performance of the Levenberg-Marquardt 
algorithm can be improved by including a second order correction known as the 

| geodesic acceleration. In this paper we give the method a more sound theoretical 

> ■ 

foundation by deriving the geodesic acceleration correction without using differen 



tial geometry and showing that the traditional convergence proofs can be adapted 
to incorporate geodesic acceleration. Unlike other methods which include second 
derivative information, the geodesic acceleration does not attempt to improve the 



>: 

Ov 
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^ ■ Gauss-Newton approximate Hessian, but rather is an extension of the small-residual 

approximation to cubic order. In deriving geodesic acceleration, we note that the 
small-residual approximation is complemented by a small-curvature approximation. 
This latter approximation provides a much broader justification for the Gauss- 
Newton approximate Hessian and Levenberg-Marquardt algorithm. In particular, 
it is justifiable even if the best fit residuals are large, is dependent only on the model 
and not on the data being fit, and is applicable for the entire course of the algorithm 
and not just the region near the minimum. 



*Elcctronic address: mkt26@corncll.edu 



2 



I. INTRODUCTION 



In this paper we consider the problem of minimizing a scalar function whose form is the 
sum of squares 

c(9) = 1 -J2 r m(e) 2 , (1) 
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where r : R/ v — > R M is an M-dimensional vector function of N parameters, 9. We refer to 
r{6) as the residuals and scalar function C(9) as the cost. Functions of this form often arise 
in the context of data fitting and represent an important class of problems as evidenced by 
the large number of software packages dedicated to their optimization. 

The structure of this particular problem lends itself to efficient optimization. In par- 
ticular, consider the Hessian matrix of second derivatives, necessary to implement a quasi 
Newton method: 

d 2 C fdr^drrn d 2 r m \ 

d9»d9 u ~ ^ \d9^ d9 u +rm d9,d9 v ) [ } 

_ ^2 dr ™ drm 



09^ 89 u 



(J T J) 



where in the last two lines we have applied the so-called Gauss-Newton or small-residual 
approximation and introduced the Jacobian matrix of first derivatives J. The approximation 
in the second line is usually justified by the hope that at a minimum of C(9), the individual 
residuals are small, so that the Hessian is dominated by the contributions from the first term. 
Numerically, this approximation is advantageous since it allows one to implement a quasi- 
Newton method by calculating only the first derivatives of the residuals. This approximation 
comes at the cost of storing the derivative information for each residual individually, but this 
is rarely a bottleneck on modern computers. Consequently, the functional form of Eq. (CQ) 
effectively allows one to estimate the Hessian matrix with the same information used to 
calculate the gradient. 

Applying a trust region method to the Gauss-Newton approximate Hessian results in the 



Levenberg-Marquardt algorithmjl 



f| which iteratively updates the parameters according to 



59 = - (J T J + XD T D) g, (3) 

where A is an appropriately chosen Langrange multiplier for the step bound 59 T D T D59 < A 2 
and g = J T r is the gradient. Because Levenberg-Marquardt is a quasi-Newton method, it 
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usually has very good convergence properties. In particular, if the small-residual approx- 
imation is good, convergence can be super-linear to a local minimum. Furthermore, for 
well-constructed choices of A and A, the method is globally convergent (3], 5]. 

Although the Levenberg-Marquardt algorithm has many desirable properties, it is not 
always ideal. Often data fitting problems have a cost function characterized by narrow, 
winding canyons. Although, asymptotically the algorithm may converge super-linearly, it 
may nevertheless spend an unreasonable amount of time navigating the winding canyon 
before it finally zooms into the minimum. This problem is typically more severe on prob- 
lems with many parameters, which in turn are often more computationally expensive to 
evaluate and lack good parameter estimates to use as starting points. An improved op- 
timization method which can find minima with fewer function evaluations (and especially 
fewer Jacobian evaluations) would be a welcome improvement. 

In order to help improve the Levenberg-Marquardt algorithm, the authors previously pro- 
posed the inclusion of a geodesic acceleration term in the algorithm {7, 8]. This correction 
was derived using an information geometric interpretation of the least-square problem and 
was justified based on the empirically observed small-extrinsic curvatures of the relevant 
manifolds. In this paper, we will see that the geodesic acceleration correction can be under- 
stood as a generalization of the Gauss-Newton method extended to cubic order. Although 
other methods exist which utilize higher-order information, geodesic acceleration is comple- 
mentary to these approaches as their primary motivation is to improve the estimate of the 
Hessian. By contrast, the geodesic acceleration assumes the Hessian estimate is adequate to 
proceed to higher-order. 

In this paper, we derive the geodesic acceleration in a geometric independent way (section 
|H| and prove that its inclusion in the Levenberg-Marquardt algorithm does not compromise 
its convergence properties (section HTT1) . From the explicit derivation in section m we see that 
geodesic acceleration can be understood as a continuation of the small-residual approxima- 
tion to higher-order terms. Indeed, the discarded terms are properly understood as the resid- 
uals coupled to the extrinsic curvatures of the Model Graph defined in references 8]. The 
small-curvature approximation, therefore, provides additional justification for the Gauss- 
Newton approximation and the geodesic acceleration correction. As we argue in section IIV| 
the small-curvature approximation is more useful than the small-residual approximation for 
many reasons. In particular, it is valid not only near a local minimum of the cost, but for 
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all parameter values; it is a property of the model and not the data being fit and so is valid 
even when the model cannot fit the data well; furthermore, numerical experiments on many 
models suggest the small-curvature approximation is nearly universally valid. 



II. DERIVATION 

In order to improve the efficiency of the Levenberg-Marquardt method, we propose mod- 
ifying the step to include higher order corrections in a numerically efficient manner. To 
derive this correction, consider the minimization problem of finding the best residuals with 
a constrained step-size. We write the dependence of the residual on the shift 59 as 

r(9 + 59) =r + J59 + 1/259 T K59 + ■ ■ ■ , (4) 

where J and K are the arrays of first and second derivatives respectively. We wish to 
minimize 

min (r + J59 + 1/2 59 T K59) 2 (5) 

with the constraint that S6 T D T D58 < A 2 . After introducing a Lagrange multiplier A for 
the constraint in the step size, the minimization becomes 

min (r + J 59 + 1/2 8B T K59) 2 + \56 T D T D56. (6) 

By varying 59 we find the normal equations: 

^ ^ JmpTm ^ (JmpJinv ^rn^m^iu A-D m ^-D ml /) 59 v 
m mu 

+ ^2 ( J mvKmna + 1/2 J m pK mua ) 59 v 59 a = 0, (7) 

vnva. 

where we have explicitly included all the indices to avoid any ambiguity. Since we constrain 
the step size, it is natural to assume that 59 is small, and we seek a solution of Eq. (0) as a 
perturbation series around the linearized equation: 

59 = 59 1 + <J0 2 + ■ ■ ■ . (8) 

Let 59\ be a solution of the linearized equation: 

56 1 = -(J T J + r T K + \D T D)- 1 J T r 
w -{J T J + XD T D)' 1 J T r J 
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where in the second line we have made the usual Gauss-Newton approximation. We do not 
actually discard the term involving K, as it will help to cancel out a higher order correction 
later in the derivation. We therefore set 



50i = -(J T J + \D T D)- l J T r, (9) 



which is the usual Levenberg-Marquardt step. 
With this definition of 50i, Eq. (JTJ) becomes 

mv 

50 lu 56 la + (r m K mfMa + W\J mv K mm ) 86 la = 0. (10) 

mua ma 

to second order, with the term r m K mna 56i a the term neglected by the Gauss-Newton ap- 
proximation at first order. 

We now turn our attention to the second term in parentheses in Eq. (fTUj) . Using the 
definition of 59\ = —{J T J + XD T D)~ 1 J T r, we can write 



^ ^ ^'m-^mfia ^ ^ S9x u J ^mv^m/Aa ^ ^ ^m^mfia ^ ^ (^ J A-D D^j ^ u J nu K, 



nfia 



m/3u 



^ ] fm I $mn ^ ^ JmfjjJ J + AZ) D^p u J nu J K na a- 
mn \ (Su / 



Since this term is proportional to the residuals, r m , it can be ignored using the usual small- 
residual arguments. However, we now make an appeal to geometric considerations by noting 
that 5 mn — Y1bi> Jm/3(J T J + D)pl J nu = P^ n is a matrix that projects vectors perpendic- 
ular to the tangent plane of the Model Graph as described in reference 8| . If the curvature 
of the model graph is small, then P N K « and this term can be neglected. We discuss the 
implications of this argument further in section IIVI 

Returning to Eq. ffTUj) . after ignoring the last term in parentheses, we find 

56 2 = -^(J T J + r T K + XD T Dy 1 J T r" 

« -1/2 {J T J + \D T DY 1 J T r\ (11) 

where we have introduced the directional second derivative = ^2 ^mnv56i u 56i u and 
in the second line made the usual Gauss-Newton approximation to the Hessian, giving 
the formula first presented in This formula was originally interpreted as the second 
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order correction to geodesic flow on the model graph, and so we refer to it as the geodesic 
acceleration correction. By analogy, we refer to the first order term as the geodesic velocity. 
The full step is therefore given by: 

69 = 69! + 69 2 = v8t + 1/2 a5t 2 . (12) 

As has been noted previously 3, |8[, although the geodesic acceleration correction includes 
second derivative information at each step of the algorithm, its calculation is not compu- 
tationally intensive. In particular, it only requires the evaluation of a directional second 
derivative, which is computationally comparable to a single evaluation of the residuals. In- 
deed, in the absence of an analytic expression, a finite difference estimate of the relevant 
second derivative can be found by a single function evaluation. In contrast, the Jacobian 
evaluation at each step is comparable to N function evaluations. Particularly for large prob- 
lems, the computational cost of including the geodesic acceleration is negligible compared 
the other elements of the algorithm. 

III. CONVERGENCE 

In order to show that geodesic acceleration does not impair the convergence guarantees of 
the Levenberg-Marquardt algorithm, we must make a few additional modifications. We first 
note that an algorithm that selects A directly requires a solution of Eq. (JSJ) given the step 
bound, i.e. find the value of A corresponding to the step bound A. This so-called subproblem, 
can be solved accurately and efficiently for the case K = 0, using the methods described by 
More|9[ and Nocedel and Wright |5[. When including geodesic acceleration however, such a 
simple solution does not exist. Indeed, the step size is no longer a monotonically decreasing 
function of A. Furthermore, accounting for the contribution from the second term requires 
additional function evaluations, making an accurate solution computationally expensive. 

Fortunately, convergence proofs for Levenberg-Marquardt do not require that this sub- 
problem be solved accurately. We therefore content ourselves by approximately solving the 
problem as follows: We first require that \60i\ < A. This step can be satisfied easily using 



the algorithm in references p, [9[. We next require that the relative contribution from the 
second order step be bounded 

9imJ 

<a (13) 
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for some a > 



17| . In practice we implement the requirement in Eq. ( |T3|) by rejecting 



all proposed steps for which it is not satisfied and decreasing A (or increasing A) until an 
acceptable step is generated, recalculating 89\ for the new A. This may appear inefficient 
since the method will on occasion reject steps that would have decreased the cost. However, 
this requirement adds stability to the algorithm, helping it to avoid undesirable fixed points 
with infinite parameter values 

uedandnu—y^in^encesflaQ. 
We need to make an additional assumption about the behavior of the model. Specifically, 

it is necessary to assume that the directional second derivative of the model is bounded: 

I Krn fJ ,vU tJ 'u u \ < k for any parameter-space unit vector u and some positive constant k. 

This assumption is necessary to guarantee that we can always find a step-size that satisfies 

Eq. (1131) . We do not anticipate this requirement to be a major restriction for the applicability 

of the algorithm as we discuss later in this section. 

With this additional assumption, we can show that our modified Levenberg-Marquardt 

algorithm enjoys the same global convergence properties as original algorithm. Indeed, the 

proof is nearly identical to that of the unmodified Levenberg-Marquardt. Our proof here 

follows closely that of Theorem 4.5 in Nocedal and Wright 5]. First we define the model 

function m k (59) by 

m k (S6) = l - (r k + J k 56) 2 (14) 



1 i i O f- _rn 1 



"Jin + -56 T JlJ k 56, 



and the reduction ratio p k by 



CW ~ C{9 k + 69k) 
Pk m k (0) - m k (59 k ) ' 1 j 



We now define an algorithm for which we will prove convergence. This algorithm is analogous 
to algorithm 4.1 in referencej^] which we recover if we were to set 592 = at each step. 
Algorithm 1 

Given A > 0, A G (0, A), and a > 
for k = 0,1,2,... do 

Calculate 59\ and 59 2 as described above 

Set 69 k = 86 x + S9 2 

if \59 2 \ > a|50i|/2 then 
Afc + i = \A k 
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9k+l — ®k 

else 

Evaluate pk as in Eq. ([15 
if p k < \ then 
Afc+i = ^A k 
else 



if p k > | and \59\\ = A fc then 

A fe+1 = min(2A fc ,A) 
else 

Afc+i = A fc 
end if 
end if 

if pk > then 

#fc+i = 9k + 86 k 
else 

Qk+l = 9k 

end if 
end if 
end for 

Before presenting our proof, first notice that by defining 9 = D6, the optimization prob- 
lem in 9 must at each step satisfy the bound \59i\ < A. Without loss of generality, we 
therefore assume that D D is the identity. This assumption essentially replaces the Jaco- 
bian matrix with J = JD~ X . With this additional assumption, we now present a Lemma 
that will be useful in proving convergence. 

Lemma 1 

Suppose that at some point 9 our function has (non-infinite) residuals r, Jacobian matrix 
J, and second derivative array K, which satisfy \J T J\ < ft and \K mfll/ u fJ- u u \ < n for any 
parameter space unit vector u. Given positive constants a > and ( > 1, then if 

CA - /irrfr-ufl ' (16) 

y/pK\r\/a + p 

\89 x \ < CA, (17) 
then l^l/l^il < a/2, where g = J T r is the function's gradient. 
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Proof 

Let A denote the Lagrange multiplier associated with the constraint in Eq. (fTTj) . Then 
CA > \66x\ = \{J T J + A) -1 ^ > \g\/(p + A), from which it follows that A > |#|/(CA) - > 
\J 0n\r\/a. Notice that since \J T J\ < 0, the largest singular value of J must be less than 
yffi. We therefore have 

= | (J T J + Xy 1 J T r\ 

< K^J + xy^Wrl 

< #1. (18) 



Similarly, 



A 



m < ^i^ii 2 . (19) 



Combining these results gives us l^l/l^il < 0n\r\/2\ 2 < a/2. 

With this Lemma, we are prepared to prove our main result: that including geodesic ac- 
celeration does not affect the convergence properties of the Levenberg-Marquardt algorithm. 

Theorem 1 

Suppose that \J T J\ < for some positive constant 0, that C is Lipschitz continuously 
different iable in the neighborhood S(Rq) for some Rq > 0, and that \S9i\ < (A k for some con- 
stant C > 1 at each iteration. Also assume that m k (0) — m k (59) > C\\g k \ min(Afc, \gk\/\ J T J\) 
for some positive constant C\ G (0, 1] (g k = J k r k ), and that the second directional derivative 
of r k in any unit direction u is bounded | ^ K mfJlv u^u u \ < k for some positive k. Then 
using Algorithm 1, 

lim inf|^ fc | = 0. (20) 

k— >oo 

Proof 

The proof is nearly identical to that of Theorem 4.5 in reference We repeat the proof 
here in order to highlight the small-differences, leaving out the algebraic details. 
With some algebraic manipulation, we obtain 

■ _ x , \ m k (66 k )-C(9 k + 60 k ) 
k m k (0) - m k (59 k ) 

Using the same argument as in reference jjjjl, we have 

\m k (S9 k )-C(9 k + 69 k )\ < (0/2 + 1 )\S9 k \ 2 , (21) 
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where (3i is the Lipschitz constant for g on the set S(Rq). 

Suppose for contradiction that there is e > and a positive index K such that 



\g k \ > e, for all k > K, (22) 



then we have 



m k (0) -m k (59 k ) > c x \g k \ min(A fc , \g k \/\ J T J\) > ciemin(A fc , e//3). (23) 

By the workings of Algorithm 1, we see that p is only calculated if \59 2 \ < a\59 2 \/2. In 
this case we have \59\ = \S6i + 59 2 \ < \59i\ + \S9 2 \ < (1 + a/2)\59 1 \ < (1 + a/2)CA fc . Each 
step therefore satisfies 

1^1 < 7A fe (24) 

where 7 = (1 + a/2)( > 1. 

Using Eqs. (1211. ([23]l . and (I2B, we have 

fo-^^wmm^e/fiy (25) 

We now derive a bound on the right-hand-side that holds for all sufficiently small-values of 
A k , that is, for all A k < A, where A is defined as follows: 

A = ^ft y W T +ft ) -T , c^irI/a + J - (26) 



As noted in reference [5j, the -R0/7 term in this definition ensures that the bound in Eq. ( 12~D 
is valid because < 7Afc < 7 A < Rq. The third term is necessary to show convergence 
when including geodesic acceleration. 

Note that since c\ < 1 and 7 > 1, we have A < e//3. The latter condition implies that 
for all Afc e [0, A], we have min(Afc, e/ j3) = A k , so from Eq. ( )25|) and (f26|) . we have 

\Pk ~ !| < ^ 

following the logic in reference j^j. Therefore, if A^ G [0, A], then > |. Furthermore, the 
third condition in Eq. ( J26|) implies that if A& e [0, A] then |5#2| < «|5^i|/2 by Lemma 1. 

Since, we have that if A^ G [0, A], then \59 2 \ < a\S9i\/2 and that p k > by the workings 
of Algorithm 1, A^ +1 > A^ whenever A k falls below the threshold A. Consequently, a 
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reduction in A k (by a factor of 1/4) can occur in the algorithm only if A*. > A. We 
conclude that 

A fc+1 > min(A fc , A/4) for all k > K. (27) 

Suppose now that there is an infinite subsequence K, such that p k > 1/4 for k G /C. For 
k G K. and k > K, we have from Eq. ( )23l) that 

C(9 k ) - C(6 k+1 ) = f(9 k )-f(9 k + 69 k ) 

> hjn k {<S)-m k {89 k )\ 

> -ciemin(A fe ,e//?). 
Since C is bounded below, it follows from this inequality that 

lim A fc = 0, 

contradicting Eq. (l27j) . Hence no such infinite subsequence /C can exist, and we must have 
p k < \ for all k sufficiently large. In this case, A k will eventually be multiplied by \ at every 
iteration, and we have lim^oo A^ = 0, which again contradicts Eq. (127|) . Hence our original 
assertion, Eq. ( )22|) must be false, proving the theorem. 

Although Theorem 1 and its subsequent proof are nearly identical to Theorem 4.5 in 
Nocedal and Wright jsj, we now briefly discuss its implication. In particular, the proof 
is that the algorithm converges to a point where cost has zero gradient. This does not 
automatically imply convergence in the parameters unless the Hessian matrix is bounded 
from below. In fact, the Hessian matrix can be very ill-conditioned, particularly for models 



with many parameters 111, LL2[ . Although the inferred parameters in such problems may be 
ill-conditioned, the convergence properties of the algorithm are nevertheless robust. 

The requirement that m k (0) — m k {59) > ci\g k \ min(Afe, \g k \/\J T J\) at each step of the 
algorithm in this context is given without motivation. However, just like the analogous the- 
orem in reference jlj], this requirement is necessary to guarantee that the parameter iterates 
do not accumulate at points with nonzero gradient. We note that in practice this condition 
is never explicitly checked. 

n 

The only new assumption of Theorem 1 beyond those for the standard algorithms [5[ is 
the bound on the second derivative. Since we are utilizing second derivative information, 
this addition is not unexpected and guarantees that the new algorithm is well-behaved. It 
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is also clearly analogous to the bound on the first derivative used in the first algorithm 
(\J T J\ <= (3, equivalent to \Ju\ <= y/J$). The standard bound on the first derivative 
already excludes models where the cost diverges at interior points in the domain of the 



model as well as points like C(6) ~ y/\9 — 8 \ where the cost would stay finite although the 
derivative diverges; the new bound on the second derivative additionally excludes cases like 
C(9) ~ \9-9 \ 3 / 2 , which should not arise often in practice. In our experience, many models 
do have costs which become singular at unphysical points (i.e., positive-only parameters set 
less than zero), which can be addressed by an appropriate reparameterization of the model 
(i.e., shifting to log parameters). When this can be done, it is also likely to improve the 
convergence rate of the algorithm. 

We have here shown convergence of geodesic acceleration for an algorithm that belongs 
to the broad class of trust region methods that operate by specifying a step bound A. 
Originally Levenberg-Marquardt was proposed as an algorithm that heuristically selected A 
directly rather than implicitly through A l|, |2[. For an algorithm that directly selects A, one 
can similarly show convergence Q]. The proof that geodesic acceleration does not impair the 
convergence of this class of algorithm is almost identical to that of the original theorem, just 
as the proof of Theorem 1 above is almost identical to that of Theorem 4.5 in reference^. 
We do not give the rigorous proof here, but merely note that under the same additional 
assumptions, convergence can be shown for these methods as well. 



IV. SMALL-CURVATURE APPROXIMATION AND RELATION TO OTHER 

METHODS 

In deriving the geodesic acceleration in section [Til we observed that the neglected terms 
were each proportional to the residuals and could be neglected in the small-residual approx- 
imation. We also noted that the neglected terms were also proportional to the extrinsic cur- 
vature on the model graph as described in reference 0, Q] • Indeed, the geodesic acceleration 
was originally understood as a small- curvature approximation rather than a small-residual 
approximation. These two approximations are complementary; in general only one of the 
two needs to be valid in order for the approximations to hold. 

In deriving the geodesic acceleration, the connection between the small-residual and small- 
curvature approximation only became apparent when considering cubic order terms. How- 
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ever, the equivalence of the two approximations can be seen in the neglected terms of the 
Gauss-Newton Hessian without considering the geodesic acceleration. At a fixed point of 
the cost, the residuals are perpendicular to the model manifold (a fact used elsewhere to 
justify a scale-free measure of convergence 1 1 31] ) . i.e. r n m ^2 m r m Pm n - Thus, as the algorithm 
approaches a minimum, the neglected term can be written as 

m ^ ran h 

As Eq. (128]) makes clear, near the best fit the nonlinear contribution to the Hessian includes 
only components perpendicular to the tangent plane and is thus proportional to the extrinsic 
curvature. 

The implication of Eq. (128j) is that the Levenberg-Marquardt algorithm may attain very 
good convergence rates even with large residuals so long as the extrinsic curvature is suffi- 
ciently small. However, the approximation in Eq. (128]) is only valid near a fixed point of the 
cost, just as is the small-residual approximation is only valid near the minimum. The advan- 
tage of identifying the equivalence of the small-residual and small-curvature approximations 
is that the latter is likely to have much wider applicability. In particular for data fitting, 
the small- curvature approximation is a feature of the model rather than the data. Thus, al- 
though the validity of the small-residual approximation cannot be identified without finding 
the best fit, the small-curvature approximation can be. Furthermore, the small-curvature 
approximation is likely to be an excellent approximation for most models. In particular, 
several examples in references js), [l^| exhibit extrinsic curvatures many orders of magnitude 
smaller than the magnitude of the bare nonlinearities. 

Although Eq. ( 128]) is only valid near a minimum, the approximation that led to the 
geodesic acceleration is valid over a much larger parameter range, and it is for this reason 
it is likely to be a useful modification. The problem with Levenberg-Marquardt is not that 
its asymptotic convergence rate is poor (super-linear convergence is satisfactory in most 
cases), but that it may spend an unreasonable amount of time navigating a narrow canyon 
before that convergence rate is realized. As originally suggested, the geodesic acceleration 
can speed up this process since it describes the curvature of the canyon. The algorithm can 
find the minimum more quickly by following the canyon with a sequence of parabolic steps, 
rather than linear steps. A numerical demonstration of this will not be given here, as it has 
already been shown elsewhere 
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There are many other methods which include second derivative information in order to 
improve the Levenberg-Marquardt algorithm. These approaches all try to improve the qual- 



ity of the Hessian estimate in some way 



15. 



16] . Thus, although geodesic acceleration may 



appear superficially similar to these approaches, it is actually quite different. The philos- 
ophy behind geodesic acceleration is to extend the small-residual/curvature approximation 
to higher order terms rather than estimate the neglected terms. Of course, the utility of 
such of an approach is ultimately measured by performance on real world problems. As has 
been shown elsewhere, geodesic acceleration can be very helpful on large problems for which 
Levenberg-Marquardt spends unreasonable time searching for a minimum before zooming 
into the best fit. 



V. CONCLUSION 

In this paper we have studied the geodesic acceleration correction to the Levenberg- 
Marquardt algorithm, originally suggested in references [?], S]. We have derived the correc- 
tion without the use of differential geometry and shown that it can be interpreted as an 
extension of the small-residual approximation used to estimate the Hessian matrix of a sum 
of squares. We have also seen that the small-residual approximation is complemented by the 
small-curvature approximation, which is likely to be applicable under much more general 
circumstances. 

We have seen that with just a few minor modifications, the geodesic acceleration cor- 
rection can be incorporated into a standard Levenberg-Marquardt routine with minimal 
computational cost and without sacrificing its convergence properties. Numerical experi- 



ments given elsewhere 

ay 

. llOj. suggest that the benefit/cost ratio of this improvement can 
be substantial on many optimization problems. 
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